{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eVufWTY3jRPx"
   },
   "source": [
    "# Detecting Issues in an Audio Dataset with Datalab\n",
    "\n",
    "In this 5-minute quickstart tutorial, we use cleanlab to find label issues in the [Spoken Digit dataset](https://www.tensorflow.org/datasets/catalog/spoken_digit) (it's like MNIST for audio). The dataset contains 2,500 audio clips with English pronunciations of the digits 0 to 9 (these are the class labels to predict from the audio).\n",
    "\n",
    "**Overview of what we'll do in this tutorial:**\n",
    "\n",
    "- Extract features from audio clips (.wav files) using a [pre-trained Pytorch model](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb) from HuggingFace that was previously fit to the [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) speech dataset.\n",
    "\n",
    "- Train a cross-validated linear model using the extracted features and generate out-of-sample predicted probabilities.\n",
    "\n",
    "- Apply cleanlab's `Datalab` audit to these predictions in order to identify which audio clips in the dataset are likely mislabeled.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Quickstart\n",
    "<br/>\n",
    "    \n",
    "Already have a `model`? Run cross-validation to get out-of-sample `pred_probs`, and then run the code below to audit your dataset and identify any potential issues.\n",
    "\n",
    "\n",
    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
    "    \n",
    "```python\n",
    "\n",
    "from cleanlab import Datalab\n",
    "\n",
    "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n",
    "lab.find_issues(pred_probs=your_pred_probs, issue_types={\"label\":{}})\n",
    "\n",
    "lab.get_issues(\"label\")\n",
    "    \n",
    "```\n",
    "    \n",
    "</div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "eqsqBq3PiUHA"
   },
   "source": [
    "## 1. Install dependencies and import them\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "i7nT-U9qc8MS"
   },
   "source": [
    "You can use `pip` to install all packages required for this tutorial as follows:\n",
    "\n",
    "```ipython3\n",
    "!pip install tensorflow==2.12.1 tensorflow_io==0.32.0 huggingface_hub==0.17.0 speechbrain==0.5.13 \n",
    "!pip install \"cleanlab[datalab]\"\n",
    "# Make sure to install the version corresponding to this tutorial\n",
    "# E.g. if viewing master branch documentation:\n",
    "#     !pip install git+https://github.com/cleanlab/cleanlab.git\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Package installation (hidden on docs website).\n",
    "# Package versions used: tensorflow==2.12.1 tensorflow-io==0.32.0 torch==2.1.2 torchaudio==2.1.2 speechbrain==0.5.13\n",
    "\n",
    "dependencies = [\"cleanlab\", \"tensorflow==2.12.1\", \"tensorflow_io==0.32.0\", \"huggingface_hub==0.17.0\", \"speechbrain==0.5.13\", \"datasets\"]\n",
    "\n",
    "# Supress outputs that may appear if tensorflow happens to be improperly installed: \n",
    "import os \n",
    "os.environ[\"TF_CPP_MIN_LOG_LEVEL\"] = \"3\" \n",
    "\n",
    "if \"google.colab\" in str(get_ipython()):  # Check if it's running in Google Colab\n",
    "    %pip install cleanlab  # for colab\n",
    "    cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n",
    "    %pip install $cmd\n",
    "else:\n",
    "    dependencies_test = [dependency.split('>')[0] if '>' in dependency \n",
    "                         else dependency.split('<')[0] if '<' in dependency \n",
    "                         else dependency.split('=')[0] for dependency in dependencies]\n",
    "    missing_dependencies = []\n",
    "    for dependency in dependencies_test:\n",
    "        try:\n",
    "            __import__(dependency)\n",
    "        except ImportError:\n",
    "            missing_dependencies.append(dependency)\n",
    "\n",
    "    if len(missing_dependencies) > 0:\n",
    "        print(\"Missing required dependencies:\")\n",
    "        print(*missing_dependencies, sep=\", \")\n",
    "        print(\"\\nPlease install them before running the rest of this notebook.\") "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "x-oboEbRdhf6"
   },
   "source": [
    "Let's import some of the packages needed throughout this tutorial.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LaEiwXUiVHCS"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import random\n",
    "import tensorflow as tf\n",
    "import torch\n",
    "\n",
    "from cleanlab import Datalab\n",
    "\n",
    "SEED = 456  # ensure reproducibility"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# This (optional) cell is hidden from docs.cleanlab.ai \n",
    "\n",
    "def set_seed(seed=0):\n",
    "    \"\"\"Ensure reproducibility.\"\"\"\n",
    "    np.random.seed(seed)\n",
    "    torch.manual_seed(seed)\n",
    "    torch.backends.cudnn.deterministic = True\n",
    "    torch.backends.cudnn.benchmark = False\n",
    "    torch.cuda.manual_seed_all(seed)\n",
    "\n",
    "\n",
    "set_seed(SEED)\n",
    "pd.options.display.max_colwidth = 500\n",
    "tf.get_logger().setLevel('FATAL')  # suppress more TF logs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SOen_sxQidLC"
   },
   "source": [
    "## 2. Load the data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "uHVskN2eeNj6"
   },
   "source": [
    "We must first fetch the dataset. To run the below command, you'll need to have `wget` installed; alternatively you can manually navigate to the link in your browser and download from there.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "GRDPEg7-VOQe",
    "outputId": "cb886220-e86e-4a77-9f3a-d7844c37c3a6"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "!wget https://github.com/Jakobovski/free-spoken-digit-dataset/archive/v1.0.9.tar.gz\n",
    "!mkdir spoken_digits\n",
    "!tar -xf v1.0.9.tar.gz -C spoken_digits"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tRvNnyB0e_IE"
   },
   "source": [
    "The audio data are .wav files in the `recordings/` folder. Note that the label for each audio clip (i.e. digit from 0 to 9) is indicated in the prefix of the file name (e.g. `6_nicolas_32.wav` has the label 6). If instead applying cleanlab to your own dataset, its classes should be represented as integer indices 0, 1, ..., num_classes - 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "FDA5sGZwUSur",
    "outputId": "0cedc509-63fd-4dc3-d32f-4b537dfe3895"
   },
   "outputs": [],
   "source": [
    "DATA_PATH = \"spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/\"\n",
    "\n",
    "# Get list of .wav file names\n",
    "# os.listdir order is nondeterministic, so for reproducibility,\n",
    "# we sort first and then do a deterministic shuffle\n",
    "file_names = sorted(i for i in os.listdir(DATA_PATH) if i.endswith(\".wav\"))\n",
    "random.Random(SEED).shuffle(file_names)\n",
    "\n",
    "file_paths = [os.path.join(DATA_PATH, name) for name in file_names]\n",
    "\n",
    "# Check out first 3 files\n",
    "file_paths[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Xi2592bVhSab"
   },
   "source": [
    "Let's listen to some example audio clips from the dataset. We introduce a `display_example` function to process the .wav file so we can listen to it in this notebook (can skip these details)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details><summary>See the implementation of `display_example` **(click to expand)**</summary>\n",
    "\n",
    "```python\n",
    "# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "\n",
    "import tensorflow_io as tfio\n",
    "from pathlib import Path\n",
    "from IPython import display\n",
    "\n",
    "# Utility function for loading audio files and making sure the sample rate is correct.\n",
    "@tf.function\n",
    "def load_wav_16k_mono(filename):\n",
    "    \"\"\"Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio.\"\"\"\n",
    "    file_contents = tf.io.read_file(filename)\n",
    "    wav, sample_rate = tf.audio.decode_wav(file_contents, desired_channels=1)\n",
    "    wav = tf.squeeze(wav, axis=-1)\n",
    "    sample_rate = tf.cast(sample_rate, dtype=tf.int64)\n",
    "    wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)\n",
    "    return wav\n",
    "\n",
    "\n",
    "def display_example(wav_file_name, audio_rate=16000):\n",
    "    \"\"\"Allows us to listen to any wav file and displays its given label in the dataset.\"\"\"\n",
    "    wav_file_example = load_wav_16k_mono(wav_file_name)\n",
    "    label = Path(wav_file_name).parts[-1].split(\"_\")[0]\n",
    "    print(f\"Given label for this example: {label}\")\n",
    "    display.display(display.Audio(wav_file_example, rate=audio_rate))\n",
    "```\n",
    "\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "import tensorflow_io as tfio\n",
    "from pathlib import Path\n",
    "from IPython import display\n",
    "\n",
    "# Utility function for loading audio files and making sure the sample rate is correct.\n",
    "@tf.function\n",
    "def load_wav_16k_mono(filename):\n",
    "    \"\"\"Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio.\"\"\"\n",
    "    file_contents = tf.io.read_file(filename)\n",
    "    wav, sample_rate = tf.audio.decode_wav(file_contents, desired_channels=1)\n",
    "    wav = tf.squeeze(wav, axis=-1)\n",
    "    sample_rate = tf.cast(sample_rate, dtype=tf.int64)\n",
    "    wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)\n",
    "    return wav\n",
    "\n",
    "\n",
    "def display_example(wav_file_name, audio_rate=16000):\n",
    "    \"\"\"Allows us to listen to any wav file and displays its given label in the dataset.\"\"\"\n",
    "    wav_file_example = load_wav_16k_mono(wav_file_name)\n",
    "    label = Path(wav_file_name).parts[-1].split(\"_\")[0]\n",
    "    print(f\"Given label for this example: {label}\")\n",
    "    display.display(display.Audio(wav_file_example, rate=audio_rate))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2bLlDRI6hzon"
   },
   "source": [
    "Click the play button below to listen to this example .wav file. Feel free to change the `wav_file_name_example` variable below to listen to other audio clips in the dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 92
    },
    "id": "dLBvUZLlII5w",
    "outputId": "c6a4917f-4a82-4a89-9193-415072e45550"
   },
   "outputs": [],
   "source": [
    "wav_file_name_example = \"spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_jackson_43.wav\"  # change this to hear other examples\n",
    "display_example(wav_file_name_example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-QvbZA7yHwkh"
   },
   "source": [
    "## 3. Use pre-trained SpeechBrain model to featurize audio\n",
    "\n",
    "The [SpeechBrain](https://github.com/speechbrain/speechbrain) package offers many Pytorch neural networks that have been pretrained for speech recognition tasks. Here we instantiate an audio feature extractor using SpeechBrain's `EncoderClassifier`. We'll use the \"spkrec-xvect-voxceleb\" network which has been pre-trained on the [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/) speech dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vL9lkiKsHvKr"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "from speechbrain.pretrained import EncoderClassifier\n",
    "\n",
    "feature_extractor = EncoderClassifier.from_hparams(\n",
    "  \"speechbrain/spkrec-xvect-voxceleb\",\n",
    "  # run_opts={\"device\":\"cuda\"}  # Uncomment this to run on GPU if you have one (optional)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vXlE6IK4ibcr"
   },
   "source": [
    "Next, we run the audio clips through the pre-trained model to extract vector features (aka embeddings).\n",
    "\n",
    "For this tutorial, ensure that you have `ffmpeg` installed on your system. This is the backend used for loading the audio files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 143
    },
    "id": "obQYDKdLiUU6",
    "outputId": "4e923d5c-2cf4-4a5c-827b-0a4fea9d87e4"
   },
   "outputs": [],
   "source": [
    "# Create dataframe with .wav file names\n",
    "df = pd.DataFrame(file_paths, columns=[\"wav_audio_file_path\"])\n",
    "df[\"label\"] = df.wav_audio_file_path.map(lambda x: int(Path(x).parts[-1].split(\"_\")[0]))\n",
    "df.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "I8JqhOZgi94g"
   },
   "outputs": [],
   "source": [
    "import torchaudio\n",
    "\n",
    "def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:\n",
    "    \"\"\"Feature extractor that embeds audio into a vector.\"\"\"\n",
    "    signal, fs = torchaudio.load(wav_audio_file_path, backend=\"ffmpeg\")  # Reformat audio signal into a tensor\n",
    "    embeddings = model.encode_batch(\n",
    "        signal\n",
    "    )  # Pass tensor through pretrained neural net and extract representation\n",
    "    return embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2FSQ2GR9R_YA"
   },
   "outputs": [],
   "source": [
    "# Extract audio embeddings\n",
    "embeddings_list = []\n",
    "for i, file_name in enumerate(df.wav_audio_file_path):  # for each .wav file name\n",
    "    embeddings = extract_audio_embeddings(feature_extractor, file_name)\n",
    "    embeddings_list.append(embeddings.cpu().numpy())\n",
    "\n",
    "embeddings_array = np.squeeze(np.array(embeddings_list))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dELkcdXgjTn_"
   },
   "source": [
    "Now we have our features in a 2D numpy array. Each row in the array corresponds to an audio clip. We're now able to represent each audio clip as a 512-dimensional feature vector!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "kAkY31IVXyr8",
    "outputId": "fd70d8d6-2f11-48d5-ae9c-a8c97d453632"
   },
   "outputs": [],
   "source": [
    "print(embeddings_array)\n",
    "print(\"Shape of array: \", embeddings_array.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "o4RBcaARmfVG"
   },
   "source": [
    "## 4. Fit linear model and compute out-of-sample predicted probabilities\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y9BIVyI9kHa4"
   },
   "source": [
    "A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted network embeddings.\n",
    "\n",
    "To identify label issues, cleanlab requires a probabilistic prediction from your model for every datapoint that should be considered. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with **out-of-sample** predicted probabilities, i.e. on datapoints held-out from the model during the training.\n",
    "\n",
    "K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset, by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split. We can obtain cross-validated out-of-sample predicted probabilities from any classifier via the [cross_val_predict](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html) wrapper provided in scikit-learn.\n",
    "Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "i_drkY9YOcw4"
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import cross_val_predict\n",
    "\n",
    "model = LogisticRegression(C=0.01, max_iter=1000, tol=1e-2, random_state=SEED)\n",
    "\n",
    "num_crossval_folds = 5  # can decrease this value to reduce runtime, or increase it to get better results\n",
    "pred_probs = cross_val_predict(\n",
    "    estimator=model, X=embeddings_array, y=df.label.values, cv=num_crossval_folds, method=\"predict_proba\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FW1yI9Ryrfkj"
   },
   "source": [
    "For each audio clip, the corresponding predicted probabilities in `pred_probs` are produced by a copy of our `LogisticRegression` model that has never been trained on this audio clip. Hence we call these predictions _out-of-sample_. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "_b-AQeoXOc7q",
    "outputId": "15ae534a-f517-4906-b177-ca91931a8954"
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "predicted_labels = pred_probs.argmax(axis=1)\n",
    "cv_accuracy = accuracy_score(df.label.values, predicted_labels)\n",
    "print(f\"Cross-validated estimate of accuracy on held-out data: {cv_accuracy}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SPz8WBwIlxUE"
   },
   "source": [
    "## 5. Use cleanlab to find label issues\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "laui-jXMm6qR"
   },
   "source": [
    "Based on the given labels, out-of-sample predicted probabilities and features, cleanlab can quickly help us identify label issues in our dataset. For a dataset with N examples from K classes, the labels should be a 1D array of length N and predicted probabilities should be a 2D (N x K) array. \n",
    "\n",
    "Here, we use cleanlab to find potential label errors in our data. `Datalab` has several ways of loading the data. In this case, we can just pass the DataFrame created above to instantiate the object. We will then pass in the predicted probabilites to the `find_issues()` method so that Datalab can use them to find potential label errors in our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lab = Datalab(df, label_name=\"label\")\n",
    "lab.find_issues(pred_probs=pred_probs, issue_types={\"label\":{}})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can view the results of running Datalab by calling the `report` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "lab.report()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe from the report that cleanlab has found some label issues in our dataset. Let us investigate these examples further.\n",
    "\n",
    "We can view the more details about the label quality for each example using the `get_issues` method, specifying `label` as the issue type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "label_issues = lab.get_issues(\"label\")\n",
    "label_issues.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled).\n",
    "\n",
    "We can then filter for the examples that have been identified as a label error:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "identified_label_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n",
    "lowest_quality_labels = identified_label_issues.sort_values(\"label_score\").index\n",
    "\n",
    "print(f\"Here are indices of the most likely errors: \\n {lowest_quality_labels.values}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iI07jQ0BnTgt"
   },
   "source": [
    "These examples flagged by cleanlab are those worth inspecting more closely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 237
    },
    "id": "FQwRHgbclpsO",
    "outputId": "fee5c335-c00e-4fcc-f22b-718705e93182"
   },
   "outputs": [],
   "source": [
    "df.iloc[lowest_quality_labels]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PsDmd5WDnZJG"
   },
   "source": [
    "Let's listen to some audio clips below of label issues that were identified in this list.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "p9jLn3Lp85rU"
   },
   "source": [
    "In this example, the given label is **6** but it sounds like **8**.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 92
    },
    "id": "ff1NFVlDoysO",
    "outputId": "8141a036-44c1-4349-c338-880432513e37"
   },
   "outputs": [],
   "source": [
    "wav_file_name_example = \"spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav\"\n",
    "display_example(wav_file_name_example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "HwokyN0bfVsn"
   },
   "source": [
    "In the three examples below, the given label is **6** but they sound quite ambiguous.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 92
    },
    "id": "GZgovGkdiaiP",
    "outputId": "d76b2ccf-8be2-4f3a-df4c-2c5c99150db7"
   },
   "outputs": [],
   "source": [
    "wav_file_name_example = \"spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav\"\n",
    "display_example(wav_file_name_example)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 92
    },
    "id": "lfa2eHbMwG8R",
    "outputId": "6627ebe2-d439-4bf5-e2cb-44f6278ae86c"
   },
   "outputs": [],
   "source": [
    "wav_file_name_example = \"spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav\"\n",
    "display_example(wav_file_name_example)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wav_file_name_example = \"spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav\"\n",
    "display_example(wav_file_name_example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-rf8iSngtV83"
   },
   "source": [
    "You can see that even widely-used datasets like Spoken Digit contain problematic labels. Never blindly trust your data! You should always check it for potential issues, many of which can be easily identified by cleanlab.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "\n",
    "highlighted_indices = [1946, 516, 469, 2132]  # verify these examples were found in find_label_issues\n",
    "if not all(x in lowest_quality_labels for x in highlighted_indices):\n",
    "    raise Exception(\"Some highlighted examples are missing from label_issues_indices.\")"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "collapsed_sections": [],
   "name": "audio_quickstart_tutorial_deterministic.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
